Method and apparatus for fast audio search

ABSTRACT

According to embodiments of the subject matter disclosed in this application, a large audio database in a multiprocessor system may be searched for a target audio clip using a robust and parallel search method. The large audio database may be partitioned into a number of smaller groups, which are dynamically scheduled to available processors in the system. Processors may process the scheduled groups in parallel by partitioning each group into smaller segments, extracting acoustic features from the segments; and modeling the segments using a common component Gaussian Mixture model (“CCGMM”). One processor may also extract acoustic features from the target audio clip and model it using the CCGMM. Kullback-Leibler (KL) distance may be further computed between the target audio clip and each segment. Based on the KL distance, a segment may be determined to match the target audio clip; and/or a number of following segments may be skipped.

BACKGROUND

1. Field

This disclosure relates generally to signal processing and multimediaapplications, and more specifically but not exclusively, to methods andapparatus for fast audio search and audio fingerprinting.

2. Description

Audio search (e.g., searching a large audio stream for an audio clip,even if the large audio stream is corrupted/distorted) has manyapplications including analysis of broadcast music/commercials,copyright management over the Internet, or finding metadata forunlabeled audio clips, and etc. A typical audio search system is serialand designed for single processor systems. It normally takes a long timefor such a search system to search for a target audio clip in a largeaudio stream. In many cases, however, an audio search system is requiredto work efficiently on large audio databases, e.g., to search largedatabases in a very short time (e.g., close to real-time). Additionally,an audio database may be partially or entirely distorted, corrupted,and/or compressed. This requires that an audio search system be robustenough to identify those audio segments that are the same as the targetaudio clip, even if those segments may be distorted, corrupted, and/orcompressed. Thus, it is desirable to have an audio search system whichcan quickly and robustly search large audio databases for a target audioclip.

BRIEF DESCRIPTION OF THE DRAWINGS

The features and advantages of the disclosed subject matter will becomeapparent from the following detailed description of the subject matterin which:

FIG. 1 shows one example computing system where robust and parallelaudio search may be performed using an audio search module;

FIG. 2 shows another example computing system where robust and parallelaudio search may be performed using an audio search module;

FIG. 3 shows yet another example computing system where robust andparallel audio search may be performed using an audio search module;

FIG. 4 is a block diagram of an example audio search module thatperforms robust audio search;

FIG. 5 is an example illustrating how a robust audio search module shownin FIG. 4 works;

FIG. 6 is a block diagram of an example audio search module thatperforms robust and parallel audio search in a multiprocessor system;

FIGS. 7A, 7B, and 7C illustrate a method of partitioning a large audiodatabase into smaller groups for robust and parallel audio search in amultiprocessor system; and

FIG. 8 is pseudo code illustrating an example process for performingrobust and parallel audio search in a multiprocessor system.

DETAILED DESCRIPTION

According to embodiments of the subject matter disclosed in thisapplication, a large audio stream or a large audio database in amultiprocessor system may be searched for a target audio clip using arobust and parallel search method. The large audio database may bepartitioned into a number of smaller groups. These smaller groups may bedynamically scheduled to be processed by available processors orprocessing cores in the multiprocessor system. Processors or processingcores may process the scheduled groups in parallel by partitioning eachgroup into smaller segments, extracting acoustic features from thesegments; and modeling the segments using a common component GaussianMixture model (“CCGMM”). The length of these segments may be the same asthe length of the target audio clip. Before processing any group, oneprocessor or processing core may extract acoustic features from thetarget audio clip and model it using the CCGMM. A Kullback-Leibler (KL)or KL-max distance may be further computed between the model of thetarget audio clip and each segment of a group. If the distance equals orsmaller than a predetermined value, the corresponding segment isidentified as the target audio clip.

If the distance is larger than a predetermined value, the processor orprocessing core may skip a certain number of segments and continuesearching for the target audio clip. Once a processor or processing corefinishes searching a group, a new groups may be given to it forprocessing to search for the target audio clip until all of the groupsare searched. The size of the groups may be determined in such a way toreduce the load imbalance and the overlapped computation. Furthermore,Input/Output (I/O) may be optimized to improve the efficiency ofparallel processing of audio groups by multiple processors or processingcores.

Reference in the specification to “one embodiment” or “an embodiment” ofthe disclosed subject matter means that a particular feature, structureor characteristic described in connection with the embodiment isincluded in at least one embodiment of the disclosed subject matter.Thus, the appearances of the phrase “in one embodiment” appearing invarious places throughout the specification are not necessarily allreferring to the same embodiment.

FIG. 1 shows one example computing system 100 where robust and parallelaudio search may be performed using an audio search module 120.Computing system 100 may comprise one or more processors 110 coupled toa system interconnect 115. Processor 110 may have multiple or manyprocessing cores (for brevity of description, term “multiple cores” willbe used hereinafter to include both multiple processing cores and manyprocessing cores). Processor 110 may include an audio search module 120to conduct robust and parallel audio search by multiple cores. The audiosearch module may comprise several components such as a partitioningmechanism, a schedule, and multiple audio searchers (see more detaileddescription for FIGS. 4-6 below). One or more components of the audiosearch module may be located in one core with others in another core.

The audio search module may first partition a large audio database intomultiple smaller groups or a large audio stream into smaller partiallyoverlapped substreams. Second, one core may process an audio clip to besearched for (“target audio clip”) to establish a model for the targetaudio clip. In the mean while, the audio search module dynamicallyschedules smaller audio groups/substreams to multiple cores, whichpartition each group/substream into segments and establish a model foreach audio segment, in parallel. The size of each segment may be equalto the size of the target audio clip. A Gaussian mixture model (“GMM”)with multiple Gaussian components, which are common to all of the audiosegments including both the target audio clip and the audiodatabase/stream, may be used for modeling each audio segment and thetarget audio clip. Once a model is established for an audio segment,Kullback-Leibler (“KL”) or KL-max distance may be computed between thesegment model and the target audio clip model. If the distance is notlarger than a predetermined value, the audio segment may be identifiedas the target audio clip. The search process may continue until allaudio groups/substreams are processed.

The computing system 100 may also include a chipset 130 coupled to thesystem interconnect 115. Chipset 130 may include one or more integratedcircuit packages or chips. Chipset 130 may comprise one or more deviceinterfaces 135 to support data transfers to and/or from other components160 of the computing system 100 such as, for example, BIOS firmware,keyboards, mice, storage devices, network interfaces, etc. Chipset 130may be coupled to a Peripheral Component Interconnect (PCI) bus 170.Chipset 130 may include a PCI bridge 145 that provides an interface tothe PCI bus 170. The PCI Bridge 145 may provide a data path between theprocessor 110 as well as other components 160, and peripheral devicessuch as, for example, an audio device 180 and a disk drive 190. Althoughnot shown, other devices may also be coupled to the PCI bus 170.

Additionally, chipset 130 may comprise a memory controller 125 that iscoupled to a main memory 150. The main memory 150 may store data andsequences of instructions that are executed by multiple cores of theprocessor 110 or any other device included in the system. The memorycontroller 125 may access the main memory 150 in response to memorytransactions associated with multiple cores of the processor 110, andother devices in the computing system 100. In one embodiment, memorycontroller 150 may be located in processor 110 or some othercircuitries. The main memory 150 may comprise various memory devicesthat provide addressable storage locations which the memory controller125 may read data from and/or write data to. The main memory 150 maycomprise one or more different types of memory devices such as DynamicRandom Access Memory (DRAM) devices, Synchronous DRAM (SDRAM) devices,Double Data Rate (DDR) SDRAM devices, or other memory devices.

FIG. 2 shows another example computing system 200 where robust andparallel audio search may be performed using an audio search module 240.System 200 may comprise multiple processors such as processor0 220A. Oneor more processors in system 200 may have many cores. System 200 mayinclude an audio search module 240 to conduct robust and parallel audiosearch by multiple cores. The audio search module may comprise severalcomponents such as a partitioning mechanism, a schedule, and multipleaudio searchers (see more detailed description for FIGS. 4-6 below). Oneor more components of the audio search module may be located in one corewith others in another core. Processors in system 200 may be connectedto each other using a system interconnect 210. System interconnect 210may be a Front Side Bus (FSB). Each processor may be connected toInput/Output (IO) devices as well as memory 230 through the systeminterconnect. All of the cores may receive audio data from memory 230.

FIG. 3 shows yet another example computing system 300 where robust andparallel audio search may be performed using an audio search module 340.In system 300, system interconnect 310 that connects multiple processors(e.g., 320A, 320B, 320C, and 320D) is a links-based point-to-pointconnection. Each processor may connect to the system interconnectthrough a links hub (e.g., 330A, 330B, 330C, and 330D). In someembodiments, a links hub may be co-located with a memory controller,which coordinates traffic to/from a system memory. One or more processormay have many cores. System 300 may include an audio search module 340to conduct robust and parallel audio search by multiple cores. The audiosearch module may comprise several components such as a partitioningmechanism, a schedule, and multiple audio searchers (see more detaileddescription for FIGS. 4-6 below). One or more components of the audiosearch module may be located in one core with others in another core.Each processor/core in system 300 may be connected to a shared memory(hot shown in the figure) through the system interconnect. All of thecores may receive audio data from the shared memory.

In FIGS. 2 and 3, the audio search module (i.e., 240 and 340) may firstpartition a large audio database into multiple smaller groups or a largeaudio stream into smaller partially overlapped substreams. Second, onecore may process an audio clip to be searched for (“target audio clip”)to establish a model for the target audio clip. In the mean while, theaudio search module dynamically schedules smaller audiogroups/substreams to multiple cores, which partition eachgroup/substream into segments and establish a model for each audiosegment, in parallel. The size of each segment may be equal to the sizeof the target audio clip. A Gaussian mixture model (“GMM”) with multipleGaussian components, which are common to all of the audio segmentsincluding both the target audio clip and the audio database/stream, maybe used for modeling each audio segment and the target audio clip. Oncea model is established for an audio segment, Kullback-Leibler (“KL”) orKL-max distance may be computed between the segment model and the targetaudio clip model. If the distance is not larger than a predeterminedvalue, the audio segment may be identified as the target audio clip. Thesearch process may continue until all audio groups/substreams areprocessed.

FIG. 4 is a block diagram of an example audio search module 400 thatperforms robust audio search. Audio search module 400 comprises afeature extractor 410, a modeling mechanism 420, and a decision maker430. Feature extractor 410 may receive an input audio stream (e.g., atarget audio clip, a substream of a large audio stream, etc.) andextract acoustic features from the input audio stream. When the inputaudio stream is an audio stream to be searched for the target audioclip, the feature extractor may apply sliding window on the audio streamto partition it into multiple overlapped segments. The window has thesame length as the target audio clip. Each segment of the input audiostream (the target audio stream has only one segment) is furtherseparated into frames. Each frame may have the same length and mayoverlap with its neighboring frames. For example, in one embodiment, aframe may be 20 milliseconds in length with the overlap between framesbeing 10 milliseconds. A feature vector may be extracted for each frame,which may include such features as Fourier coefficients, Mel-Frequencycepstral coefficients, spectral flatness, and means, variances, otherderivatives thereof. Feature vectors from all of the frames in an audiosegment form a feature vector sequence.

The overlap between two adjacent segments is to reduce the likelihood ofmissing any target audio clip between two adjacent segments. The longerthe overlap is, the less likely a miss is. In one embodiment, theoverlap may be equal to the length of a segment minus the length of aframe to avoid missing any match. However, longer overlap means morecomputation. Thus, there should be a balance between the computationload and the likelihood of miss (e.g., the overlap is equal to or lessthan ½ of the segment length). In any case, feature vectors for framesthat are overlapped between two segments only need to be extracted once.

Modeling mechanism 420 may establish a model for an audio segment basedon its feature vector sequence extracted by feature extractor 410.Depending on what model is used, the modeling mechanism will estimateparameters for the model. In one embodiment, a common component Gaussianmixture model (“CCGMM”) may be used for modeling an audio segment. TheCCGMM includes multiple Gaussian components which are common across allof the segments. For each segment, the modeling mechanism estimates aspecific set of mixture weights for the common Gaussian components. Inanother embodiment, other models (e.g., hidden Markov model) may be usedfor modeling an audio segment. In one embodiment, only the target audioclip may be modeled; and the feature vector sequence of an audio segmentmay be directly used to determine whether the audio segment issubstantially the same as the target audio clip.

Decision maker 430 may determine whether an audio segment in the inputaudio stream is sufficiently similar so that the audio segment can beidentified as a copy of the target audio clip. To achieve this goal, thedecision maker may derive a similarity measure by comparing the model ofthe audio segment and the model of the target audio clip. In oneembodiment, the similarity measure may be a distance computed betweenthe two models. In another embodiment, the similarity measure may beprobability of the audio segment model being the same as the targetaudio clip model. Yet in another embodiment, the similarity measure maybe derived by comparing the feature vector sequence of the audio segmentand the model of the target audio clip. For example, when a hiddenMarkov model (“HMM”) is used to model the target audio clip, a Viterbibased algorithm may be used to compute a likelihood score between theaudio segment and the target audio clip, based on the feature vectorsequence of the audio segment and the HMM of the target audio clip.

Based on the value of the similarity measure, the decision maker maydetermine whether an audio segment can be identified as the target audioclip. For example, if the value of the similarity measure is not largerthan a predetermined threshold (e.g., similarity measure is distancebetween the audio segment model and the target audio clip), the audiosegment may be identified as substantially the same as the target audioclip. Similarly, the audio segment may be identified as substantiallythe same as the target audio clip if the value of the similarity measureis not smaller than a predetermined threshold (e.g., similarity measureis a likelihood score of the audio segment being substantially the sameas the target audio clip). On the other hand, if an audio segment isfound to be substantially different from the target audio clip based onthe similarity measure, a certain number of segments immediatelyfollowing the audio segment may be skipped. The actual number ofsegments to be skipped will depend on the value of the similaritymeasure and/or empirical data. By skipping a number of followingsegments, it is not likely to miss any target audio clip when thesimilarity measure indicate the current segment is so different from thetarget audio clip because the window used to partition an input audiostream into segments slides forward gradually and as a result, there iscontinuity of similarity measure from one segment to the next.

FIG. 5 is an example illustrating how a robust audio search module shownin FIG. 4 works. A target audio clip 510 is received by a featureextractor which segments it into frames and produces a feature vectorsequence (540) at block 530A, with a feature vector per frame. A featurevector may be an x dimensional vector (wherein x>=1) because the featurevector may include one or more parameters. At block 570A, Feature vectorsequence 540 may be modeled using a GMM as shown below:

$\begin{matrix}{{P^{(k)}(x)} = {\sum\limits_{i = 1}^{M}{w_{i}^{(k)}{{N( {{x\mu_{i}^{(k)}},\sum\limits_{i}^{(k)}} )}.}}}} & (1)\end{matrix}$

The GMM,P^((k))(x), includes M Gaussian components with componentweightsW_(i) ^((k)), means μ_(i) ^((k)), and covariance Σ_(i) ^((k)),with i=1, 2, . . . , M; wherein k denotes segment k and N( ) denotes aGaussian distribution. For the target audio clip, there is only onesegment, and hence there is no need to use k to identify a segment. Forthe input audio stream 520, however, there is typically more than onesegment, and it is thus desirable to identify the GMM for differentsegments.

In the example shown in FIG. 5, Kullback-Leibler (KL) or KL-max distanceis used as a similarity measure. To simplify KL-max distancecomputation, it is assumed that the GMMs used for all the audio segmentsshare a common set of Gaussian components, i.e., for the i^(th) Gaussiancomponent, the mean (μ_(i)) and variance (Σ_(i)) are the same acrossdifferent audio segments. As a result, Equation (1) becomes:

$\begin{matrix}{{P^{(k)}(x)} = {\sum\limits_{i = 1}^{M}{w_{i}^{(k)}{{N( {{x\mu_{i}},\sum\limits_{i}} )}.}}}} & (2)\end{matrix}$

For each audio segment, only a set of weights, W_(i) ^((k)), i=1, 2, . .. , M, need to be estimated for the common Gaussian components. Given afeature vector sequence for segment k, which has T feature vectors,x_(t)(t=1, 2, . . . , T), weights may be estimated as follows,

$\begin{matrix}{{w_{i}^{(k)} = {\frac{1}{T}{\sum\limits_{t = 1}^{T}\frac{w_{i}^{(u)}{N( {{x_{t}\mu_{i}},\sum\limits_{i}} )}}{\sum\limits_{j = 1}^{M}{w_{j}^{(u)}{N( {{x_{t}\mu_{j}},\sum\limits_{j}} )}}}}}},} & (3)\end{matrix}$

wherein w_(i) ^((u)) or w_(j) ^((u)) is a universal weight for thei^(th) or j^(th) segment, which may be obtained by experiments based onsome sample audio files or be initialized with a random value.

An input audio stream 520, which is to be searched for the target audioclip 510, may be received by a feature extractor. At block 530B, thefeature extractor partitions the input audio stream into partiallyoverlapped segments. For each segment, the feature extractor furtherpartitions the segment into multiple partially overlapped frames andextracts a feature vector from each frame. Block 560 shows a featurevector sequence for the input audio stream 520 and also illustrates howthe audio stream is partitioned into partially overlapped segments. Forexample, a window with the size being the same as the length of thetarget audio clip may be applied to input audio stream 520. Forillustration purpose, a window is shown for the feature vector sequenceof the target audio clip to obtain a segment 560A although there istypically no need to apply a window to the target audio clip since thereis only one segment. A shifting window is applied to the input audiostream to obtain multiple partially overlapped segments such as 560B and560C. The window shifts by time τ from segment 560B to segment 560C,where τ is smaller than the window size.

Each audio segment is modeled using the CCGMM, for example, segment 560Bis modeled at block 570B and segment 560C is modeled at block 570C.Models for each segment of input audio stream 520 and for target audioclip 510 have common Gaussian components with different sets of weights.In one embodiment, feature vectors may be extracted from the entireinput audio stream frame by frame to produce a long feature vectorsequence for the entire input audio stream. A window with a length beingN×FL (where N is a positive integer and FL is the frame length) issubsequently applied to the long feature vector sequence. Featurevectors within the window constitute a feature vector for an audiosegment, which is used to establish a CCGMM. The window is shiftingforward by τ time.

To determine if a segment is substantially the same as the target audioclip, KL-max distance may be calculated between the model of the segmentand the target audio clip as follows,

$\begin{matrix}{d_{KLMAX} = {\max\limits_{{i = 1},2,\ldots \mspace{14mu},M}{( {{\underset{.}{w}}_{i}^{(1)} - w_{i}^{(2)}} )\log \; {\frac{w_{i}^{(1)}}{w_{i}^{(2)}}.}}}} & (4)\end{matrix}$

If the KL-max distance so calculated is below a predetermined threshold,the audio clip may be considered to be detected. As the window appliedover input audio stream 520 shifts forward in time, distances typicallyshow certain continuity from one time-step to the next. In other words,if the distance is too large, it is unlikely that one or more segmentsimmediately following the current segment matches the target audio clip.Thus, depending on the value of the distance, a certain number ofimmediately following segments in the same audio stream/substream may beskipped from search.

FIG. 6 is a block diagram of an example audio search module 600 thatperforms robust and parallel audio search in a multiprocessor system.The audio search module 600 comprises a partitioning mechanism 610, ascheduler 620, an I/O optimizer 630, and a plurality of audio searchers(e.g., 640A, 640N). Partitioning mechanism 610 may partition a largeaudio stream into multiple smaller substreams and/or a large audiodatabase into multiple smaller groups. FIGS. 7A, 7B, and 7C illustrate amethod of partitioning a large audio database into smaller groups forrobust and parallel audio search in a multiprocessor system. FIG. 7Ashows an example database that contains a single large audio stream 710.The partitioning mechanism may partition audio stream 710 into multiplesmaller substreams such as 712, 714, and 716, with each substreamconstituting a group. The length of substreams can vary from each other,but it is normally uniform for the simplicity purpose. To avoid missingany correct detection of a target audio clip, each substream overlapswith its immediately following substream; and the overlap between twoadjacent substreams (e.g., 712 and 714, 714 and 716) should equal orlonger than FNClip−1, where FNClip is the total number of frames in thetarget audio clip.

FIG. 7B shows another example database that includes multiple relativelysmall audio streams (e.g., 720, 725, 730, 735, and 740). In oneembodiment, partitioning mechanism 610 may partition the database intomultiple smaller groups with each group consisting of only one audiostream. In another embodiment, the partitioning mechanism may partitionthe database into multiple smaller groups with some groups eachconsisting of only one audio stream and with others each consisting ofmore than one small audio stream, as illustrated in FIG. 7B. FIG. 7Cshows yet another example database that includes some relatively smallaudio streams (e.g., 750, 755, and 760) as well as some large audiostream (e.g., 770). The partitioning mechanism may put those relativelysmall audio streams into groups with each group consisting of only oneaudio stream or with some groups consisting of only one audio stream(e.g., 750) while others consisting of more than one small audio streams(e.g., 755 and 760 may be grouped together). As for a large audio streamsuch as 770, the partitioning mechanism may partition it into multiplepartially overlapped smaller substreams (e.g., 712 and 714) with eachsubstream constituting a group, using the method illustrated in FIG. 7A.

Additionally, the partitioning mechanism partitions a large audiodatabase into groups with proper sizes to reduce the overlappedcomputation (in the situation where a large audio stream is partitionedinto multiple overlapped smaller substreams) and load imbalance inparallel processing by multiple processors. Smaller group size mayresult in large overlapped computation, while larger group size mayresult in considerable load imbalance. In one embodiment, the group sizemay be about 25 times of the size of the target audio clip.

Turning back to FIG. 6, scheduler 620 may dynamically schedule multiplegroups of a large database into multiple processors in themultiprocessor system with each processor having one group to process atone time. The scheduler periodically checks the availability ofprocessors in the system and assigns an audio group for each availableprocessor to process and search for the target audio clip. If anotherprocessor becomes available later, the scheduler may assign one group tothis processor. The scheduler also assigns an unsearched audio group toprocessor immediately after it finishes searching its previouslyassigned group no matter whether other processors finish theirsearching. In fact, even for groups with the same size, searching forthe same target audio clip may take different amount of time fordifferent processors because the number of segments to be skipped may bedifferent from one segment to another. Using dynamic scheduling asoutlined above may further reduce load imbalance effectively.

I/O optimizer 630 may optimize I/O traffic on the system interconnect(e.g., system bus connecting a shared system memory with processors inthe system). The I/O optimizer may decide not to load the entire audiodatabase to be searched for from the disk into the memory in thebeginning while the data range for each processor is being defined.Additionally, the I/O optimizer may let each processor read only aportion of its assigned segment from the memory at one time. Byoptimizing the I/O traffic, the I/O optimizer may reduce I/O contention,implement the overlap of I/O operations and computation, and help toimprove computation efficiency. As a result, the scalability of audiosearch can be significantly improved.

Audio search module Audio 600 also comprises a plurality of audiosearchers 640A through 640N. Each audio searcher (e.g., 640A) is locatedin a processor to process a group assigned to the processor and tosearch for the target audio clip. Similar to an audio searching module400 shown in FIG. 4, an audio searcher includes a feature extractor(e.g., 410), a modeling mechanism (e.g., 420), and a decision maker(e.g., 430). Each audio searcher conducts serial active search of anaudio group assigned to it for a target audio clip by partitioning audiostreams in the audio group into partially overlapped segments withlength being the same as the target audio clip, extracting featurevector sequence for each segment, and modeling each segment using aCCGMM as illustrated in Equations (1) through (4). Additionally, theCCGMM for the target audio clip which is used by all of the audiosearchers just needs to be estimated once by one of the audio searchers.Each audio searcher computes KL-max distance between the model for eachsegment and the model of the target audio clip. Based, on the KL-maxdistance, an audio searcher may determine if the target audio clip isdetected. Moreover, each audio searcher may skip a number of segmentsthat following the current segment if the KL-max distance for thecurrent segment is larger than a threshold.

FIG. 8 is pseudo code illustrating an example process 800 for performingrobust and parallel audio search in a multiprocessor system. At line802, audio search module may be initialized, e.g., target audio clipfile and audio database file may be opened, and global parameters may beinitialized. At line 804, a large audio database may be partitioned intoNG smaller groups as illustrated in FIGS. 7A, 7B, and 7C. At line 806, amodel (e.g., CCGMM) may be established for the target audio clip. Atline 808, NG audio groups may be dynamically scheduled to availableprocessors and parallel processing of the scheduled groups may bestarted. Line 808 uses one example instruction that sets up parallelimplementation and other parallel implementation instructions may alsobe used.

Lines 810 through 846 illustrate how each of NG groups are processed andsearched for the target in parallel by a processor in the multiprocessorsystem. It is worth noting that for illustration purpose, process inlines 812 to 846 is shown as iteration from the first group until thelast group. In practice, if there are several processors available,several groups are processed in parallel by these available processors.At line 814, some or all of audio streams in each group may be furtherpartitioned into NS partially overlapped segments if such streams arelonger in time than the target audio clip. Line 816 starts iterativeprocess for each segment of the group, shown in lines 818 through 832.At line 820, a feature vector sequence (frame by frame) may be extractedfrom the segment. At line 822, a model (e.g., CCGMM as shown inEquations (1) to (3)) may be established for the segment. At line 824,distance (e.g., KL-max distance as shown in Equation (4)) between thesegment model and the target audio clip model may be computed. At line826, whether the segment matches the target audio clip or not may bedetermined based on the distance calculated in line 824 and apredetermined threshold #1. If the distance is less than the threshold#1, the segment matches the target audio clip. At line 828, whether anumber of following segments (e.g., M segments) in the same audiostream/substream may be skipped from searching may be determined basedon the distance calculated in line 824 and a predetermined threshold #2.If the distance is larger than the threshold #2, M segments may beskipped from searching. In one embodiment, the number of segments to beskipped may vary depending upon the value of the distance. At line 830,the search results (e.g., index or starting time of a match segment ineach group) may be stored in an array which is local to the processorthat processes the group. At line 842, search results from local arraysfrom all of the processors may be summarized and outputted to a user.

Using the robust and parallel search strategy as outlined in FIG. 8along with other techniques such as I/O optimization, search speed for atarget audio clip in a large audio database in a multiprocessor systemmay be significantly improved. One experiment shows that search speedfor a 15 second target audio clip in a 27 hour audio stream increases by11 times on a 16-way Unisys system, compared to serial search of thesame audio stream for the same target audio clip.

In one embodiment, a modified search strategy may be used. Using thisstrategy, a preliminary model (e.g., CCGMM) may be established for thefirst K frames (K>=1) of the target audio clip along with a full modelfor the entire target audio clip. Accordingly, a preliminary model(e.g., CCGMM) may be first established for the first K frames (K>=1) ofan audio segment. During active search, the preliminary model of thefirst K frames of each audio segment may be first compared with thepreliminary model of the first K frames of the target audio clip toproduce a preliminary similarity measure. If the preliminary similaritymeasure indicates that these two preliminary models are significantlysimilar, a full model may be established for the entire audio segmentand compared with the full model of the entire target audio clip;otherwise, no full model will be established for the audio segment andthe next segment may be searched by first establishing a preliminarymodel for its first K frames and by comparing this preliminary modelwith the preliminary model of the target audio clip. This modifiedsearch strategy may further reduce computation load.

Although an example embodiment of the disclosed subject matter isdescribed with reference to block and flow diagrams in FIGS. 1-8,persons of ordinary skill in the art will readily appreciate that manyother methods of implementing the disclosed subject matter mayalternatively be used. For example, the order of execution of the blocksin flow diagrams may be changed, and/or some of the blocks in block/flowdiagrams described may be changed, eliminated, or combined.

In the preceding description, various aspects of the disclosed subjectmatter have been described. For purposes of explanation, specificnumbers, systems and configurations were set forth in order to provide athorough understanding of the subject matter. However, it is apparent toone skilled in the art having the benefit of this disclosure that thesubject matter may be practiced without the specific details. In otherinstances, well-known features, components, or modules were omitted,simplified, combined, or split in order not to obscure the disclosedsubject matter.

Various embodiments of the disclosed subject matter may be implementedin hardware, firmware, software, or combination thereof, and may bedescribed by reference to or in conjunction with program code, such asinstructions, functions, procedures, data structures, logic, applicationprograms, design representations or formats for simulation, emulation,and fabrication of a design, which when accessed by a machine results inthe machine performing tasks, defining abstract data types or low-levelhardware contexts, or producing a result.

For simulations, program code may represent hardware using a hardwaredescription language or another functional description language whichessentially provides a model of how designed hardware is expected toperform. Program code may be assembly or machine language, or data thatmay be compiled and/or interpreted. Furthermore, it is common in the artto speak of software, in one form or another as taking an action orcausing a result. Such expressions are merely a shorthand way of statingexecution of program code by a processing system which causes aprocessor to perform an action or produce a result.

Program code may be stored in, for example, volatile and/or non-volatilememory, such as storage devices and/or an associated machine readable ormachine accessible medium including solid-state memory, hard-drives,floppy-disks, optical storage, tapes, flash memory, memory sticks,digital video disks, digital versatile discs (DVDs), etc., as well asmore exotic mediums such as machine-accessible biological statepreserving storage. A machine readable medium may include any mechanismfor storing, transmitting, or receiving information in a form readableby a machine, and the medium may include a tangible medium through whichelectrical, optical, acoustical or other form of propagated signals orcarrier wave encoding the program code may pass, such as antennas,optical fibers, communications interfaces, etc. Program code may betransmitted in the form of packets, serial data, parallel data,propagated signals, etc., and may be used in a compressed or encryptedformat.

Program code may be implemented in programs executing on programmablemachines such as mobile or stationary computers, personal digitalassistants, set top boxes, cellular telephones and pagers, and otherelectronic devices, each including a processor, volatile and/ornon-volatile memory readable by the processor, at least one input deviceand/or one or more output devices. Program code may be applied to thedata entered using the input device to perform the described embodimentsand to generate output information. The output information may beapplied to one or more output devices. One of ordinary skill in the artmay appreciate that embodiments of the disclosed subject matter can bepracticed with various computer system configurations, includingmultiprocessor or multiple-core processor systems, minicomputers,mainframe computers, as well as pervasive or miniature computers orprocessors that may be embedded into virtually any device. Embodimentsof the disclosed subject matter can also be practiced in distributedcomputing environments where tasks may be performed by remote processingdevices that are linked through a communications network.

Although operations may be described as a sequential process, some ofthe operations may in fact be performed in parallel, concurrently,and/or in a distributed environment, and with program code storedlocally and/or remotely for access by single or multi-processormachines. In addition, in some embodiments the order of operations maybe rearranged without departing from the spirit of the disclosed subjectmatter. Program code may be used by or in conjunction with embeddedcontrollers.

While the disclosed subject matter has been described with reference toillustrative embodiments, this description is not intended to beconstrued in a limiting sense. Various modifications of the illustrativeembodiments, as well as other embodiments of the subject matter, whichare apparent to persons skilled in the art to which the disclosedsubject matter pertains are deemed to lie within the scope of thedisclosed subject matter.

1. A method for searching an audio database for a target audio clip in amultiprocessor system, comprising: partitioning said audio database intoa plurality of groups; establishing a model for said target audio clip;dynamically scheduling said plurality of groups to a plurality ofprocessors in said multiprocessor system; and processing said scheduledgroups in parallel by said plurality of processors to search for saidtarget audio clip.
 2. The method of claim 1, wherein partitioning saidaudio database comprises determining a size for each of said pluralityof groups, said size being determined to reduce the amount of overlappedcomputation among said plurality of groups and load imbalance inparallel processing of said plurality of groups.
 3. The method of claim1, wherein establishing a model for said target audio clip comprisesextracting a feature vector sequence from said target audio clip andmodeling said feature vector sequence based on a Gaussian Mixture model(“GMM”), said GMM including a plurality of Gaussian components.
 4. Themethod of claim 3, wherein modeling said feature vector sequencecomprises estimating mixture weights for each of said plurality ofGaussian components.
 5. The method of claim 1, wherein processing saidscheduled groups in parallel comprises: partitioning each of saidscheduled groups into at least one segment; and for each segment,extracting a feature vector sequence for the segment, and modeling saidfeature vector sequence based on a Gaussian Mixture model (“GMM”), saidGMM including a plurality of Gaussian components.
 6. The method of claim5, wherein each of said at least one segment has the same length in timeas that of said target audio clip.
 7. The method of claim 5, wherein ifthere are more than one segments in an audio stream, each segmentpartially overlaps with a segment that immediately precedes thatsegment.
 8. The method of claim 5, wherein said plurality of Gaussiancomponents are common for different segments and said target audio clip.9. The method of claim 8, wherein modeling said feature vector sequencecomprises estimating mixture weights for each of said plurality ofGaussian components.
 10. The method of claim 9, further comprising: foreach segment, computing a Kullback-Leibler (“KL”) distance between a GMMof said segment and a GMM of said target audio clip; and determiningthat said segment matches said target audio clip, if said KL distance issmaller than a pre-determined threshold.
 11. The method of claim 10,further comprising skipping processing a number of segments if said KLdistance is larger than a predetermined value, said number of segmentsdependent on the value of said KL distance.
 12. The method of claim 1,wherein said multiprocessor system comprises a memory shared by saidplurality of processors.
 13. An apparatus for searching an audiodatabase for a target audio clip in a multiprocessor system, comprising:a partitioning module to partition said audio database into a pluralityof groups; a scheduler to dynamically schedule said plurality of groupsto a plurality of processors in said multiprocessor system; and an audiosearching module for each of said plurality of processors to processsaid scheduled groups in parallel by said plurality of processors tosearch for said target audio clip.
 14. The apparatus of claim 13,wherein said partitioning module further determines a size for each ofsaid plurality of groups, said size being determined to reduce theamount of overlapped computation among said plurality of groups and loadimbalance in parallel processing of said plurality of groups.
 15. Theapparatus of claim 13, wherein an audio searching module comprises: afeature extractor to partition an input audio stream into at least onesegment and to extract a feature vector sequence from each of said atleast one segment, said at least one segment having the same length intime as that of said target audio clip; and a modeling module to modelsaid feature vector sequence for each segment based on a GaussianMixture model (“GMM”), said GMM including a plurality of Gaussiancomponents, said plurality of Gaussian components being common among allof the segments.
 16. The apparatus of claim 15, wherein one of audiosearching modules further process said target audio clip by extracting afeature vector sequence from said target audio clip and by modeling saidfeature vector sequence using said GMM, said GMM including a pluralityof Gaussian components common for said target audio clip and segments ofsaid input audio stream.
 17. The apparatus of claim 16, wherein an audiosearching module further comprising a decision maker to compute aKullback-Leibler (“KL”) distance between a GMM of a segment of saidinput audio stream and a GMM of said target audio clip; and to determinewhether said segment matches said target audio clip based on said KLdistance.
 18. The apparatus of claim 17, wherein said decision modulefurther determines how many segments are to be skipped from processingbased on said KL distance.
 19. An article comprising a machine-readablemedium that contains instructions, which when executed by a processingplatform, cause said processing platform to perform operationscomprising: partitioning said audio database into a plurality of groups;establishing a model for said target audio clip; dynamically schedulingsaid plurality of groups to a plurality of processors in saidmultiprocessor system; and processing said scheduled groups in parallelby said plurality of processors to search for said target audio clip.20. The article of claim 19, wherein partitioning said audio databasecomprises determining a size for each of said plurality of groups, saidsize being determined to reduce the amount of overlapped computationamong said plurality of groups and load imbalance in parallel processingof said plurality of groups.
 21. The article of claim 19, whereinestablishing a model for said target audio clip comprises extracting afeature vector sequence from said target audio clip and modeling saidfeature vector sequence based on a Gaussian Mixture model (“GMM”), saidGMM including a plurality of Gaussian components.
 22. The article ofclaim 21, wherein modeling said feature vector sequence comprisesestimating mixture weights for each of said plurality of Gaussiancomponents.
 23. The article of claim 19, wherein processing saidscheduled groups in parallel comprises: partitioning each of saidscheduled groups into at least one segment; and for each segment,extracting a feature vector sequence for the segment, and modeling saidfeature vector sequence based on a Gaussian Mixture model (“GMM”), saidGMM including a plurality of Gaussian components.
 24. The article ofclaim 22, wherein each of said at least one segment has the same lengthin time as that of said target audio clip.
 25. The article of claim 22,wherein if there are more than one segments in an audio stream, eachsegment partially overlaps with a segment that immediately precedes thatsegment.
 26. The article of claim 22, wherein said plurality of Gaussiancomponents are common for different segments and said target audio clip.27. The article of claim 26, wherein modeling said feature vectorsequence comprises estimating mixture weights for each of said pluralityof Gaussian components.
 28. The article of claim 27, wherein saidoperations further comprise: for each segment, computing aKullback-Leibler (“KL”) distance between a GMM of said segment and a GMMof said target audio clip; and determining that said segment matchessaid target audio clip, if said KL distance is smaller than apredetermined threshold.
 29. The article of claim 28, wherein saidoperations further comprise skipping processing a number of segments ifsaid KL distance is larger than a predetermined value, said number ofsegments dependent on the value of said KL distance.
 30. The article ofclaim 19, wherein said multiprocessor system comprises a memory sharedby said plurality of processors.